py_tutorial_reading (Score: 30.5 / 38.0)

  1. Test cell (Score: 1.0 / 1.0)
  2. Test cell (Score: 1.0 / 1.0)
  3. Test cell (Score: 1.0 / 1.0)
  4. Test cell (Score: 1.0 / 1.0)
  5. Test cell (Score: 1.0 / 1.0)
  6. Test cell (Score: 1.0 / 1.0)
  7. Written response (Score: 3.0 / 3.0)
  8. Test cell (Score: 3.0 / 3.0)
  9. Written response (Score: 0.0 / 3.0)
  10. Comment
  11. Test cell (Score: 1.0 / 1.0)
  12. Test cell (Score: 1.0 / 1.0)
  13. Test cell (Score: 1.0 / 1.0)
  14. Test cell (Score: 1.0 / 1.0)
  15. Written response (Score: 3.0 / 3.0)
  16. Comment
  17. Test cell (Score: 1.0 / 3.0)
  18. Written response (Score: 3.0 / 3.0)
  19. Written response (Score: 3.0 / 3.0)
  20. Test cell (Score: 1.0 / 1.0)
  21. Test cell (Score: 1.0 / 1.0)
  22. Test cell (Score: 1.0 / 1.0)
  23. Written response (Score: 0.5 / 3.0)
  24. Test cell (Score: 1.0 / 1.0)
  25. Comment

Tutorial: Reading in data locally and from the web

This worksheet covers the Reading in data locally and from the web chapter of the online textbook, which also lists the learning objectives for this worksheet. You should read the textbook chapter before attempting this worksheet.

In [1]:
### Run this cell before continuing.
import os

import altair as alt
import pandas as pd

# Simplify working with large datasets in Altair
# You can read the following URL if you are interested in further
# understanding large datasets in Altair but it is not required 
# https://altair-viz.github.io/user_guide/large_datasets.html
alt.data_transformers.enable('vegafusion')
Out[1]:
DataTransformerRegistry.enable('vegafusion')
In [2]:
### Run this cell before continuing.
try:
    os.remove("data/WHR2018Chapter2OnlineData.xls")
except:
    None
In [3]:
### Run this cell before continuing.
try:
    os.remove("data/project_data.csv")
except:
    None

1. Happiness Report

As you might remember from worksheet_02, we practised loading data from the Sustainable Development Solutions Network's World Happiness Report. That data was the output of their analysis that calculated each country's happiness score and how much each variable contributed to it. In this tutorial, we are going to look at the data at an earlier stage of the study - the aggregated/averaged values (per country and year) for many different social and health aspects that the researchers anticipated might contribute to happiness (Table2.1 from this Excel spreadsheet).

The goal for today is to produce a plot of 2017's positive affect scores against healthy life expectancy at birth, with healthy life expectancy at birth on the x-axis and positive affect on the y-axis. For this study, positive affect was defined as the average of three positive affect measures: happiness, laughter and enjoyment. We would also like to convert the positive affect score from a scale of 0 - 1 to a scale from 0 - 10.

  1. use the [] notation to subset the rows where the year is equal to 2017
  2. use assign to convert the "Positive affect" score from a scale of 0 - 1 to a scale from 0 - 10
  3. use [] to choose the "Healthy life expectancy at birth" column and the scaled "Positive affect" column
  4. use altair to create a plot of "Positive affect" (y-axis) versus "Healthy life expectancy at birth" (x-axis)

Tips for success: Try going through all of the steps on your own, but don't forget to discuss with others (classmates, TAs, or an instructor) if you get stuck. If something is wrong and you can't spot the issue, be sure to read the error message carefully. Since there are a lot of steps involved in working with data and modifying it, feel free to look back at worksheet_02.

Question 1.1 Multiple Choice:
{points: 1}

What is the maximum value for the "Positive affect" score (in the original data file that you read into Python)?

A. 100

B. 10

C. 1

D. 0.1

E. 5

Assign your answer to an object called answer1_1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [4]:
Student's answer(Top)
answer1_1 = "C"
In [5]:
Grade cell: cell-f1313c7d1348ac8c Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"dc859").hexdigest() == "d5153f09cd7d1c205729942551f9f22d3466df48", "type of answer1_1 is not str. answer1_1 should be an str"
assert sha1(str(len(answer1_1)).encode("utf-8")+b"dc859").hexdigest() == "fcab68f0dd1c77aee35013660b7506642e5aaa84", "length of answer1_1 is not correct"
assert sha1(str(answer1_1.lower()).encode("utf-8")+b"dc859").hexdigest() == "bba28be227ab4d2cc557066e987357b6a8d14c39", "value of answer1_1 is not correct"
assert sha1(str(answer1_1).encode("utf-8")+b"dc859").hexdigest() == "b40cf96465d9c9019579834a41a30f0cf70a680f", "correct string value of answer1_1 but incorrect case of letters"

print('Success!')
Success!

Question 1.2 Multiple Choice:
{points: 1}

Which column's values will be used to filter the data?

A. countries

B. generosity

C. positive affect

D. year

Assign your answer to an object called answer1_2. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [6]:
Student's answer(Top)
answer1_2 = "D"
In [7]:
Grade cell: cell-8bd586d238ebce67 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer1_2)).encode("utf-8")+b"5c19e").hexdigest() == "7593227dbe13c5ad69ed738f93361e688c8de19a", "type of answer1_2 is not str. answer1_2 should be an str"
assert sha1(str(len(answer1_2)).encode("utf-8")+b"5c19e").hexdigest() == "897d485e1521ff61d5b1fd8cd7efaa7ffb17d57c", "length of answer1_2 is not correct"
assert sha1(str(answer1_2.lower()).encode("utf-8")+b"5c19e").hexdigest() == "22c5ae2640573b21fd1e4646b9569ce24438e574", "value of answer1_2 is not correct"
assert sha1(str(answer1_2).encode("utf-8")+b"5c19e").hexdigest() == "d6fe4ff1f2e336c58c3e84a1e92e3c2602b3fe82", "correct string value of answer1_2 but incorrect case of letters"

print('Success!')
Success!

Question 1.3.0
{points: 1}

Use the appropriate .read_* function to read in the WHR2018Chapter2OnlineData (look in the tutorial_02 directory to ensure you use the correct relative path to read it in).

Assign the data frame to an object called happy_df_csv.

In [8]:
Student's answer(Top)
happy_df_csv = pd.read_csv("data/WHR2018Chapter2OnlineData.csv")
happy_df_csv
Out[8]:
country year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect Confidence in national government Democratic Quality Delivery Quality Standard deviation of ladder by country-year Standard deviation/Mean of ladder by country-year GINI index (World Bank estimate) GINI index (World Bank estimate), average 2000-15 gini of household income reported in Gallup, by wp5-year
0 Afghanistan 2008 3.723590 7.168690 0.450662 49.209663 0.718114 0.181819 0.881686 0.517637 0.258195 0.612072 -1.929690 -1.655084 1.774662 0.476600 NaN NaN NaN
1 Afghanistan 2009 4.401778 7.333790 0.552308 49.624432 0.678896 0.203614 0.850035 0.583926 0.237092 0.611545 -2.044093 -1.635025 1.722688 0.391362 NaN NaN 0.441906
2 Afghanistan 2010 4.758381 7.386629 0.539075 50.008961 0.600127 0.137630 0.706766 0.618265 0.275324 0.299357 -1.991810 -1.617176 1.878622 0.394803 NaN NaN 0.327318
3 Afghanistan 2011 3.831719 7.415019 0.521104 50.367298 0.495901 0.175329 0.731109 0.611387 0.267175 0.307386 -1.919018 -1.616221 1.785360 0.465942 NaN NaN 0.336764
4 Afghanistan 2012 3.782938 7.517126 0.520637 50.709263 0.530935 0.247159 0.775620 0.710385 0.267919 0.435440 -1.842996 -1.404078 1.798283 0.475367 NaN NaN 0.344540
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1557 Zimbabwe 2013 4.690188 7.565154 0.799274 48.949745 0.575884 -0.076716 0.830937 0.711885 0.182288 0.527755 -1.026085 -1.526321 1.964805 0.418918 NaN 0.432 0.555439
1558 Zimbabwe 2014 4.184451 7.562753 0.765839 50.051235 0.642034 -0.045885 0.820217 0.725214 0.239111 0.566209 -0.985267 -1.484067 2.079248 0.496899 NaN 0.432 0.601080
1559 Zimbabwe 2015 3.703191 7.556052 0.735800 50.925652 0.667193 -0.094585 0.810457 0.715079 0.178861 0.590012 -0.893078 -1.357514 2.198865 0.593776 NaN 0.432 0.655137
1560 Zimbabwe 2016 3.735400 7.538829 0.768425 51.800068 0.732971 -0.065283 0.723612 0.737636 0.208555 0.699344 -0.863044 -1.371214 2.776363 0.743257 NaN 0.432 0.596690
1561 Zimbabwe 2017 3.638300 7.538187 0.754147 52.674484 0.752826 -0.066005 0.751208 0.806428 0.224051 0.682647 NaN NaN 2.656848 0.730244 NaN 0.432 0.581484

1562 rows × 19 columns

In [9]:
Grade cell: cell-41eb2114195d336c Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(happy_df_csv)).encode("utf-8")+b"81065").hexdigest() == "6f71e45e0e2d84898f4a471e7916ad73c386411b", "type of type(happy_df_csv) is not correct"

assert sha1(str(type(happy_df_csv.shape)).encode("utf-8")+b"81066").hexdigest() == "094e3f18671b02275683f1e17592d79d6aa83f7a", "type of happy_df_csv.shape is not tuple. happy_df_csv.shape should be a tuple"
assert sha1(str(len(happy_df_csv.shape)).encode("utf-8")+b"81066").hexdigest() == "ef5b468a1e1c0d1b9304f4bbc48d4669ade3b1fa", "length of happy_df_csv.shape is not correct"
assert sha1(str(sorted(map(str, happy_df_csv.shape))).encode("utf-8")+b"81066").hexdigest() == "544a991eaafc9defc507f2af5f42093f6c0880d9", "values of happy_df_csv.shape are not correct"
assert sha1(str(happy_df_csv.shape).encode("utf-8")+b"81066").hexdigest() == "598f82ab5376c9088eaa257763f095eca724038f", "order of elements of happy_df_csv.shape is not correct"

assert sha1(str(type(sum(happy_df_csv.year))).encode("utf-8")+b"81067").hexdigest() == "bce63f8b3da81fff51fbbfcbc2407b5fd7d16b4e", "type of sum(happy_df_csv.year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(happy_df_csv.year)).encode("utf-8")+b"81067").hexdigest() == "bbb458109f14d9b7745890a7dae34f0f796b0777", "value of sum(happy_df_csv.year) is not correct"

print('Success!')
Success!

Question 1.3.1
{points: 1}

Above, you loaded the data from a file we already downloaded and converted to a .csv for you. But you can also use the same function to directly load in Excel files into Python. Given that the data we loaded above (WHR2018Chapter2OnlineData.csv) was originally sourced from an Excel file on the web, let's now directly read that Excel file into Python using the read_excel function from pandas package. This Excel file has multiple sheets, the data we want is on the first one.

To answer the question, fill in the blanks in the code below. If you are unsure, try reading the documentation for the new functions and ask others for help!

Assign the data into an object called happy_df.

In [10]:
Student's answer(Top)
url = "https://github.com/UBC-DSCI/dsci-100-student/raw/refs/heads/master/data/reading/WHR2018Chapter2OnlineData.xls"

happy_df = pd.read_excel(url, sheet_name = "Table2.1")
happy_df
Out[10]:
country year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect Confidence in national government Democratic Quality Delivery Quality Standard deviation of ladder by country-year Standard deviation/Mean of ladder by country-year GINI index (World Bank estimate) GINI index (World Bank estimate), average 2000-15 gini of household income reported in Gallup, by wp5-year
0 Afghanistan 2008 3.723590 7.168690 0.450662 49.209663 0.718114 0.181819 0.881686 0.517637 0.258195 0.612072 -1.929690 -1.655084 1.774662 0.476600 NaN NaN NaN
1 Afghanistan 2009 4.401778 7.333790 0.552308 49.624432 0.678896 0.203614 0.850035 0.583926 0.237092 0.611545 -2.044093 -1.635025 1.722688 0.391362 NaN NaN 0.441906
2 Afghanistan 2010 4.758381 7.386629 0.539075 50.008961 0.600127 0.137630 0.706766 0.618265 0.275324 0.299357 -1.991810 -1.617176 1.878622 0.394803 NaN NaN 0.327318
3 Afghanistan 2011 3.831719 7.415019 0.521104 50.367298 0.495901 0.175329 0.731109 0.611387 0.267175 0.307386 -1.919018 -1.616221 1.785360 0.465942 NaN NaN 0.336764
4 Afghanistan 2012 3.782938 7.517126 0.520637 50.709263 0.530935 0.247159 0.775620 0.710385 0.267919 0.435440 -1.842996 -1.404078 1.798283 0.475367 NaN NaN 0.344540
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1557 Zimbabwe 2013 4.690188 7.565154 0.799274 48.949745 0.575884 -0.076716 0.830937 0.711885 0.182288 0.527755 -1.026085 -1.526321 1.964805 0.418918 NaN 0.432 0.555439
1558 Zimbabwe 2014 4.184451 7.562753 0.765839 50.051235 0.642034 -0.045885 0.820217 0.725214 0.239111 0.566209 -0.985267 -1.484067 2.079248 0.496899 NaN 0.432 0.601080
1559 Zimbabwe 2015 3.703191 7.556052 0.735800 50.925652 0.667193 -0.094585 0.810457 0.715079 0.178861 0.590012 -0.893078 -1.357514 2.198865 0.593776 NaN 0.432 0.655137
1560 Zimbabwe 2016 3.735400 7.538829 0.768425 51.800068 0.732971 -0.065283 0.723612 0.737636 0.208555 0.699344 -0.863044 -1.371214 2.776363 0.743257 NaN 0.432 0.596690
1561 Zimbabwe 2017 3.638300 7.538187 0.754147 52.674484 0.752826 -0.066005 0.751208 0.806428 0.224051 0.682647 NaN NaN 2.656848 0.730244 NaN 0.432 0.581484

1562 rows × 19 columns

In [11]:
Grade cell: cell-476734f8979f1d94 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(happy_df)).encode("utf-8")+b"1020c").hexdigest() == "1f10fbe3d8b406d12e133dc9fd7998488a6d4077", "type of type(happy_df) is not correct"

assert sha1(str(type(happy_df.shape)).encode("utf-8")+b"1020d").hexdigest() == "8f839a49fff0377b46b51ff1ab9cbc2d14b0d51b", "type of happy_df.shape is not tuple. happy_df.shape should be a tuple"
assert sha1(str(len(happy_df.shape)).encode("utf-8")+b"1020d").hexdigest() == "a607346270a5f4307050520e13ece51fcbfc73cf", "length of happy_df.shape is not correct"
assert sha1(str(sorted(map(str, happy_df.shape))).encode("utf-8")+b"1020d").hexdigest() == "3fb0e9dd92592d62d7812e35e2ad9b95d889933a", "values of happy_df.shape are not correct"
assert sha1(str(happy_df.shape).encode("utf-8")+b"1020d").hexdigest() == "f27ad6fd652e254a75b49ce18681547e22da69f5", "order of elements of happy_df.shape is not correct"

assert sha1(str(type(sum(happy_df.year))).encode("utf-8")+b"1020e").hexdigest() == "8a256b236c83b59a7b768fa7411894136a71f551", "type of sum(happy_df.year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(happy_df.year)).encode("utf-8")+b"1020e").hexdigest() == "d4aa9311b59f16d7f442b1ed96943b70d9f400d7", "value of sum(happy_df.year) is not correct"

print('Success!')
Success!

Look at the column names - they contain spaces!!! This is not a best practice for column names. Run the cell below to replace all the spaces with a _. The columns method is also needed to access the data frame's column names.

*Note: Since every column name is a string, we could directly call the replace method on each column names.

In [12]:
Student's answer(Top)
happy_df.columns = happy_df.columns.str.replace(" ", "_")

Question 1.3.2
{points: 1}

Using the scaffolding given in the cell below, apply [], assign, and [] (again!) to the happy_df data frame as needed to get it ready to create our desired scatterplot. Recall that we wanted to rescale the "Positive affect" scores so that they fall in the range 0-10 instead of 0-1. Call the new, re-scaled column Positive_affect_scaled.

Assign the data frame containing only the columns we need to create our plot to an object called reduced_happy_df.

In [13]:
Student's answer(Top)
happy_step1 = happy_df[happy_df['year'] == 2017]
happy_step2 = happy_step1.assign(Positive_affect_scaled = happy_step1["Positive_affect"] * 10)
reduced_happy_df = happy_step2[["Positive_affect_scaled", "Healthy_life_expectancy_at_birth"]]

reduced_happy_df
Out[13]:
Positive_affect_scaled Healthy_life_expectancy_at_birth
9 4.963486 52.339527
19 6.692409 69.051659
25 6.419796 65.699188
41 8.094226 67.538704
53 6.250138 65.125687
... ... ...
1505 8.389891 63.166180
1528 NaN 66.245766
1538 4.551819 54.981033
1549 6.846233 53.803783
1561 8.064284 52.674484

141 rows × 2 columns

In [14]:
Grade cell: cell-dec4163f6664b7f2 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(happy_step1 is None)).encode("utf-8")+b"9a4bb").hexdigest() == "785a94c80b68e1f57cb0bc1a71a7bb04f243e137", "type of happy_step1 is None is not bool. happy_step1 is None should be a bool"
assert sha1(str(happy_step1 is None).encode("utf-8")+b"9a4bb").hexdigest() == "c14ce643bab8a8fb0e9d079364879af983f284d7", "boolean value of happy_step1 is None is not correct"

assert sha1(str(type(happy_step2 is None)).encode("utf-8")+b"9a4bc").hexdigest() == "c985bc2b3343a9fe26a170f069640edef7865a89", "type of happy_step2 is None is not bool. happy_step2 is None should be a bool"
assert sha1(str(happy_step2 is None).encode("utf-8")+b"9a4bc").hexdigest() == "a63dabb6cf9af53a6e57301426fa178a67391a37", "boolean value of happy_step2 is None is not correct"

assert sha1(str(type(reduced_happy_df is None)).encode("utf-8")+b"9a4bd").hexdigest() == "3398ec32080eddaf05865968ae0c09d1653cbaa0", "type of reduced_happy_df is None is not bool. reduced_happy_df is None should be a bool"
assert sha1(str(reduced_happy_df is None).encode("utf-8")+b"9a4bd").hexdigest() == "f25a36e171dc5e00acdc6cd3c84ef3c23b63dd9c", "boolean value of reduced_happy_df is None is not correct"

assert sha1(str(type(reduced_happy_df.shape)).encode("utf-8")+b"9a4be").hexdigest() == "cac32ad02b0b44893f1e261b4efbbfda8deec224", "type of reduced_happy_df.shape is not tuple. reduced_happy_df.shape should be a tuple"
assert sha1(str(len(reduced_happy_df.shape)).encode("utf-8")+b"9a4be").hexdigest() == "2d3cda6077cd2b468638c811df047a34120ea2a0", "length of reduced_happy_df.shape is not correct"
assert sha1(str(sorted(map(str, reduced_happy_df.shape))).encode("utf-8")+b"9a4be").hexdigest() == "dc0188f1ff6a016373353a9c8404481f96035eb6", "values of reduced_happy_df.shape are not correct"
assert sha1(str(reduced_happy_df.shape).encode("utf-8")+b"9a4be").hexdigest() == "346dc91ab22044def37c855d7caf91db46a79681", "order of elements of reduced_happy_df.shape is not correct"

assert sha1(str(type(happy_step1.year.unique())).encode("utf-8")+b"9a4bf").hexdigest() == "6e60d6c6310836eb31a59a444aaf75bfbce96e7b", "type of happy_step1.year.unique() is not correct"
assert sha1(str(happy_step1.year.unique()).encode("utf-8")+b"9a4bf").hexdigest() == "e1ec8fada5a449b9eb90bbf82857250c15cc8582", "value of happy_step1.year.unique() is not correct"

assert sha1(str(type(sum(reduced_happy_df.Positive_affect_scaled.dropna()))).encode("utf-8")+b"9a4c0").hexdigest() == "b8737dde77e6164f4a073d0430445b5fdb7103de", "type of sum(reduced_happy_df.Positive_affect_scaled.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(reduced_happy_df.Positive_affect_scaled.dropna()), 2)).encode("utf-8")+b"9a4c0").hexdigest() == "e75fcff8e73c2ad22195e184f4b847b95173829f", "value of sum(reduced_happy_df.Positive_affect_scaled.dropna()) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(reduced_happy_df.Healthy_life_expectancy_at_birth.dropna()))).encode("utf-8")+b"9a4c1").hexdigest() == "9efb151ffb62be308d929fb376104c94fe982083", "type of sum(reduced_happy_df.Healthy_life_expectancy_at_birth.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(reduced_happy_df.Healthy_life_expectancy_at_birth.dropna()), 2)).encode("utf-8")+b"9a4c1").hexdigest() == "0f24f296e96f3584b3f2b193c118c96c5ce463ca", "value of sum(reduced_happy_df.Healthy_life_expectancy_at_birth.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 1.4
{points: 1}

Using the modified data set, reduced_happy_df, generate the scatterplot described above and make sure to label the axes in proper written English. We have included scale(zero=False) in the code so that the plot isn't fixed to start at zero.

Assign your plot to an object called happy_plot.

In [15]:
Student's answer(Top)
happy_plot = alt.Chart(reduced_happy_df).mark_point().encode(
    x=alt.X("Healthy_life_expectancy_at_birth")
        .title("Health life expectancy at birth (Years)")
        .scale(zero=False),
    y=alt.Y("Positive_affect_scaled")
        .title("Positive affect score (out of 10)")
        .scale(zero=False)
)

happy_plot
Out[15]:
In [16]:
Grade cell: cell-cc81148d52ca3d36 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(happy_plot.encoding.x['shorthand'])).encode("utf-8")+b"240bf").hexdigest() == "5501ea0bc5051f491d540b1e1f1f8516206f64a0", "type of happy_plot.encoding.x['shorthand'] is not str. happy_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(happy_plot.encoding.x['shorthand'])).encode("utf-8")+b"240bf").hexdigest() == "f12d2d4d180bf75894d4b8b023f4e1021237a198", "length of happy_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(happy_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"240bf").hexdigest() == "a5a449b54c828a1996d19cc84f98152548f1595c", "value of happy_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(happy_plot.encoding.x['shorthand']).encode("utf-8")+b"240bf").hexdigest() == "012ebc6773c8eaa9f7529e4141c980f153c43bf2", "correct string value of happy_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(happy_plot.encoding.y['shorthand'])).encode("utf-8")+b"240c0").hexdigest() == "032d5959c58fe8b10a57ec8e389a82abaec352e9", "type of happy_plot.encoding.y['shorthand'] is not str. happy_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(happy_plot.encoding.y['shorthand'])).encode("utf-8")+b"240c0").hexdigest() == "48740d144e1c1e32ea6ba25ca938820b493446c6", "length of happy_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(happy_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"240c0").hexdigest() == "1cc80b8a26d84b16c78b958f9cf2367c5cfd5d46", "value of happy_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(happy_plot.encoding.y['shorthand']).encode("utf-8")+b"240c0").hexdigest() == "fb9fe9d3f723cb0c6c2b05eda5e8dc7196d7324a", "correct string value of happy_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(happy_plot.mark)).encode("utf-8")+b"240c1").hexdigest() == "b91cbf70fa314673f13cdd4aaecbadecf3b3bc3c", "type of happy_plot.mark is not str. happy_plot.mark should be an str"
assert sha1(str(len(happy_plot.mark)).encode("utf-8")+b"240c1").hexdigest() == "fd0a497027579f563ed4563d72c03cc08dbe33f4", "length of happy_plot.mark is not correct"
assert sha1(str(happy_plot.mark.lower()).encode("utf-8")+b"240c1").hexdigest() == "c88a9c47782507e5f0d3f70216d77ce88f3e09f2", "value of happy_plot.mark is not correct"
assert sha1(str(happy_plot.mark).encode("utf-8")+b"240c1").hexdigest() == "c88a9c47782507e5f0d3f70216d77ce88f3e09f2", "correct string value of happy_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(happy_plot.encoding.x['title'], str))).encode("utf-8")+b"240c2").hexdigest() == "08a50759f1734a2278d1b2703748e308506405dd", "type of isinstance(happy_plot.encoding.x['title'], str) is not bool. isinstance(happy_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(happy_plot.encoding.x['title'], str)).encode("utf-8")+b"240c2").hexdigest() == "40131500c79785a2e939089e7b33cf568e37a443", "boolean value of isinstance(happy_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(happy_plot.encoding.y['title'], str))).encode("utf-8")+b"240c3").hexdigest() == "fbe9ef33eb133d5116f217f6ca76856cf624d0e4", "type of isinstance(happy_plot.encoding.y['title'], str) is not bool. isinstance(happy_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(happy_plot.encoding.y['title'], str)).encode("utf-8")+b"240c3").hexdigest() == "62d55e8d1037c5792b8c68fefdcf3816d354a9f6", "boolean value of isinstance(happy_plot.encoding.y['title'], str) is not correct"

print('Success!')
Success!

Question 1.5
{points: 3}

In one sentence or two, describe what you see in the scatterplot above. Does there appear to be a relationship between life expectancy at birth and postive affect? If so, describe it.

Student's answer Score: 3.0 / 3.0 (Top)

There does not seem to be a relationship between life expectancy and a postitive affect, as there is a great deal of variability in the data.

Question 1.6
{points: 3}

Plot the perceptions of corruption against healthy life expectancy at birth (using the unmodified happy_df data for the chart). You should NOT scale the variables to be plotted. Ensure that healthy life expectancy at birth is on the x-axis and that you give your axes human-readable labels.

Assign your plot to an object called happy_plot_2.

In [17]:
Student's answer(Top)
happy_plot_2 = alt.Chart(happy_df).mark_point().encode(
    x=alt.X("Healthy_life_expectancy_at_birth")
        .title("Health life expectancy at birth (Years)")
        .scale(zero=False),
    y=alt.Y("Perceptions_of_corruption").title("Perceptions of corruption")
)
happy_plot_2
Out[17]:
In [18]:
Grade cell: cell-4770f90dc16ee481 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(happy_plot_2.mark)).encode("utf-8")+b"73b0c").hexdigest() == "3459c71fc33b1c2cfd22085e2b4e55fc37606255", "type of happy_plot_2.mark is not str. happy_plot_2.mark should be an str"
assert sha1(str(len(happy_plot_2.mark)).encode("utf-8")+b"73b0c").hexdigest() == "4a1233cfe68f36f670967f80d9b334d4bcde70f5", "length of happy_plot_2.mark is not correct"
assert sha1(str(happy_plot_2.mark.lower()).encode("utf-8")+b"73b0c").hexdigest() == "832a2078abd834964b786d4ec8d461e3900e04d7", "value of happy_plot_2.mark is not correct"
assert sha1(str(happy_plot_2.mark).encode("utf-8")+b"73b0c").hexdigest() == "832a2078abd834964b786d4ec8d461e3900e04d7", "correct string value of happy_plot_2.mark but incorrect case of letters"

assert sha1(str(type(happy_plot_2.encoding.x['shorthand'])).encode("utf-8")+b"73b0d").hexdigest() == "544f409ad2e6d7c06a95da933c2fe5a747986f56", "type of happy_plot_2.encoding.x['shorthand'] is not str. happy_plot_2.encoding.x['shorthand'] should be an str"
assert sha1(str(len(happy_plot_2.encoding.x['shorthand'])).encode("utf-8")+b"73b0d").hexdigest() == "6169bdff2fa9cd0926d09abd3d1a3c3e54514b5c", "length of happy_plot_2.encoding.x['shorthand'] is not correct"
assert sha1(str(happy_plot_2.encoding.x['shorthand'].lower()).encode("utf-8")+b"73b0d").hexdigest() == "f83f3ed1cf820c54394d090f50dac993f09de6ba", "value of happy_plot_2.encoding.x['shorthand'] is not correct"
assert sha1(str(happy_plot_2.encoding.x['shorthand']).encode("utf-8")+b"73b0d").hexdigest() == "54e51eeab15d13155250ea86753d6ed5fa2e6344", "correct string value of happy_plot_2.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(happy_plot_2.encoding.y['shorthand'] != 'Healthy_life_expectancy_at_birth')).encode("utf-8")+b"73b0e").hexdigest() == "d8b973b70ad7d27e542db724dfc90398d8d8ad68", "type of happy_plot_2.encoding.y['shorthand'] != 'Healthy_life_expectancy_at_birth' is not bool. happy_plot_2.encoding.y['shorthand'] != 'Healthy_life_expectancy_at_birth' should be a bool"
assert sha1(str(happy_plot_2.encoding.y['shorthand'] != 'Healthy_life_expectancy_at_birth').encode("utf-8")+b"73b0e").hexdigest() == "fc957ed694530d42eb968dfce600c05924cafb63", "boolean value of happy_plot_2.encoding.y['shorthand'] != 'Healthy_life_expectancy_at_birth' is not correct"

assert sha1(str(type(happy_plot_2.encoding.y['shorthand'] != 'Positive_affect_scaled')).encode("utf-8")+b"73b0f").hexdigest() == "41cb43147648d5d4c9717baf7fbb55315b4df787", "type of happy_plot_2.encoding.y['shorthand'] != 'Positive_affect_scaled' is not bool. happy_plot_2.encoding.y['shorthand'] != 'Positive_affect_scaled' should be a bool"
assert sha1(str(happy_plot_2.encoding.y['shorthand'] != 'Positive_affect_scaled').encode("utf-8")+b"73b0f").hexdigest() == "a54cad90a69f02c59f86f776be94bac70312cf04", "boolean value of happy_plot_2.encoding.y['shorthand'] != 'Positive_affect_scaled' is not correct"

assert sha1(str(type(happy_plot_2.encoding.y['shorthand'] in happy_df.columns)).encode("utf-8")+b"73b10").hexdigest() == "fb18a8f95f6ad4cb43975119841312c4162b3e0b", "type of happy_plot_2.encoding.y['shorthand'] in happy_df.columns is not bool. happy_plot_2.encoding.y['shorthand'] in happy_df.columns should be a bool"
assert sha1(str(happy_plot_2.encoding.y['shorthand'] in happy_df.columns).encode("utf-8")+b"73b10").hexdigest() == "dcade83be91cd7399ec5d200718d318ff5096a29", "boolean value of happy_plot_2.encoding.y['shorthand'] in happy_df.columns is not correct"

assert sha1(str(type(isinstance(happy_plot_2.encoding.x['title'], (str, list)))).encode("utf-8")+b"73b11").hexdigest() == "5a150cf41d3cd4e70a4f5a9144ff5d9beacdc5ef", "type of isinstance(happy_plot_2.encoding.x['title'], (str, list)) is not bool. isinstance(happy_plot_2.encoding.x['title'], (str, list)) should be a bool"
assert sha1(str(isinstance(happy_plot_2.encoding.x['title'], (str, list))).encode("utf-8")+b"73b11").hexdigest() == "7d60383fb77516d2133cbe2754dbf54fd5744259", "boolean value of isinstance(happy_plot_2.encoding.x['title'], (str, list)) is not correct"

assert sha1(str(type(isinstance(happy_plot_2.encoding.y['title'], (str, list)))).encode("utf-8")+b"73b12").hexdigest() == "ed845346852755609c297989e5b27a8487724271", "type of isinstance(happy_plot_2.encoding.y['title'], (str, list)) is not bool. isinstance(happy_plot_2.encoding.y['title'], (str, list)) should be a bool"
assert sha1(str(isinstance(happy_plot_2.encoding.y['title'], (str, list))).encode("utf-8")+b"73b12").hexdigest() == "a6e0de0171b40e496563ea4ec4bb447d896da832", "boolean value of isinstance(happy_plot_2.encoding.y['title'], (str, list)) is not correct"

print('Success!')
Success!

Question 1.7
{points: 3}

In a sentence or two, describe what you see in the scatterplot above. Does there appear to be a relationship between healthy life expectancy at birth and the other variable you plotted? If so, describe it.

Student's answer Score: 0.0 / 3.0 (Top)

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

2. Whistler Snow

Skiing and snowboarding are huge in British Columbia. Some of the best slopes for snow sports are quite close. In fact, the famous mountain-bearing city of Whistler is just two hours north of Vancouver. With cold weather and plenty of snowfall, Whistler is an ideal destination for winter sports fanatics.

One thing skiers and snowboarders want is fresh snow! When are they most likely to find this? In the data directory, we have two-year-long data sets from Environment Canada from the Whistler Roundhouse Station (on Whistler mountain). This weather station is located 1,835 m above sea level.

To answer the question of "When are skiers and snowboarders most likely to find fresh snow at Whistler?" you will create a line plot with the date is on the x-axis and the total snow per day in centimetres (the column named Total Snow cm in the data file) on the y-axis. Given that we have data for two years (2017 & 2018), we will create one plot for each year to see if there is a trend we can observe across the two years.

Question 2.1 Multiple Choice:
{points: 1}

What are we going to plot on the y-axis?

A. total precipitation per day in centimetres

B. total snow on the ground in centimetres

C. total snow per day in centimetres

D. total rain per day in centimetres

Assign your answer to an object called answer2_1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [19]:
Student's answer(Top)
answer2_1 = "C"
In [20]:
Grade cell: cell-09cb9f902ccc6326 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer2_1)).encode("utf-8")+b"6c86d").hexdigest() == "14577187daeb60c9f292f92ffc3d11f24b10d747", "type of answer2_1 is not str. answer2_1 should be an str"
assert sha1(str(len(answer2_1)).encode("utf-8")+b"6c86d").hexdigest() == "e7e1f30207096fb5a7b0e6af542bf9027023659c", "length of answer2_1 is not correct"
assert sha1(str(answer2_1.lower()).encode("utf-8")+b"6c86d").hexdigest() == "5c363060313b02b7dc78f50fb4ff2d261c25a48e", "value of answer2_1 is not correct"
assert sha1(str(answer2_1).encode("utf-8")+b"6c86d").hexdigest() == "a32c3c5d93812962a4638db5c58a44aedc9c0f6b", "correct string value of answer2_1 but incorrect case of letters"

print('Success!')
Success!

Question 2.2.0
{points: 1}

Read in the file named eng-daily-01012018-12312018.csv from the data directory. Make sure you preview the file to choose the correct read_* function and argument values to get the data into Python.

Assign your data frame to an object called whistler_2018.

*Note: You'll see a lot of entries of the form NaN. This is the symbol Python uses to denote missing data.

In [21]:
Student's answer(Top)
import pandas as pd
whistler_2018 = pd.read_csv("data/eng-daily-01012018-12312018.csv", sep=",", skiprows=24)
whistler_2018
Out[21]:
Date/Time Year Month Day Data Quality Max Temp (°C) Max Temp Flag Min Temp (°C) Min Temp Flag Mean Temp (°C) ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
0 2018-01-01 2018 1 1 6.0 NaN -6.0 NaN 0.0 ... 0.0 NaN 0.0 NaN 187.0 NaN NaN NaN NaN NaN
1 2018-01-02 2018 1 2 5.5 NaN 0.0 NaN 2.8 ... 0.0 NaN 0.0 NaN 183.0 NaN NaN NaN NaN NaN
2 2018-01-03 2018 1 3 6.0 NaN 0.0 NaN 3.0 ... 0.0 NaN 0.0 NaN 180.0 NaN NaN NaN NaN NaN
3 2018-01-04 2018 1 4 5.0 NaN 1.5 NaN 3.3 ... 0.0 NaN 2.5 NaN 179.0 NaN NaN NaN NaN NaN
4 2018-01-05 2018 1 5 3.0 NaN 0.5 NaN 1.8 ... 0.0 T 13.4 NaN 175.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
360 2018-12-27 2018 12 27 -3.5 NaN -9.0 NaN -6.3 ... 1.0 NaN 1.0 NaN 207.0 NaN NaN NaN NaN NaN
361 2018-12-28 2018 12 28 2.0 NaN -8.0 NaN -3.0 ... 8.0 NaN 8.0 NaN 206.0 NaN NaN NaN NaN NaN
362 2018-12-29 2018 12 29 0.0 NaN -3.0 NaN -1.5 ... 9.0 NaN 9.0 NaN 206.0 NaN NaN NaN NaN NaN
363 2018-12-30 2018 12 30 -5.5 NaN -8.0 NaN -6.8 ... 0.0 NaN 0.0 NaN 205.0 NaN NaN NaN NaN NaN
364 2018-12-31 2018 12 31 -2.0 NaN -12.0 NaN -7.0 ... 1.0 NaN 1.0 NaN 198.0 NaN NaN NaN NaN NaN

365 rows × 27 columns

In [22]:
Grade cell: cell-37bb46c77b5a6b65 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(whistler_2018 is None)).encode("utf-8")+b"5ec0e").hexdigest() == "beb8924b0ddbd0ea172cd393e030e1ecc882727d", "type of whistler_2018 is None is not bool. whistler_2018 is None should be a bool"
assert sha1(str(whistler_2018 is None).encode("utf-8")+b"5ec0e").hexdigest() == "034ef6b4e9dc72de42e6a8ac93eeb707d0115bdc", "boolean value of whistler_2018 is None is not correct"

assert sha1(str(type(whistler_2018)).encode("utf-8")+b"5ec0f").hexdigest() == "f6103de6c82ea452d157122479770df0b5227f33", "type of type(whistler_2018) is not correct"

assert sha1(str(type(whistler_2018.shape)).encode("utf-8")+b"5ec10").hexdigest() == "16a64cd290a3e128138db43e74fc44239d6abc2d", "type of whistler_2018.shape is not tuple. whistler_2018.shape should be a tuple"
assert sha1(str(len(whistler_2018.shape)).encode("utf-8")+b"5ec10").hexdigest() == "f9b79589c004eb307c95ccce7a4f6e870ceef64a", "length of whistler_2018.shape is not correct"
assert sha1(str(sorted(map(str, whistler_2018.shape))).encode("utf-8")+b"5ec10").hexdigest() == "b534ce29859c0a76ee4fe5c9a79db5f3d989442f", "values of whistler_2018.shape are not correct"
assert sha1(str(whistler_2018.shape).encode("utf-8")+b"5ec10").hexdigest() == "edcaa9862443e2e50ea70d37d49a8841a4966575", "order of elements of whistler_2018.shape is not correct"

assert sha1(str(type(sum(whistler_2018['Total Snow (cm)'].dropna()))).encode("utf-8")+b"5ec11").hexdigest() == "aea4ba5a6e6713626c01e79cbf61e3e1b332b067", "type of sum(whistler_2018['Total Snow (cm)'].dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(whistler_2018['Total Snow (cm)'].dropna()), 2)).encode("utf-8")+b"5ec11").hexdigest() == "2761a7df814efc651aec7797793b96288d9d2834", "value of sum(whistler_2018['Total Snow (cm)'].dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 2.2.1
{points: 1}

Looking at the column names of the whistler_2018 data frame, you can see we have white space in our column names again. Replace the white space with _ using replace method.

In [23]:
Student's answer(Top)
whistler_2018.columns = whistler_2018.columns.str.replace(" ","_")
whistler_2018.columns
Out[23]:
Index(['Date/Time', 'Year', 'Month', 'Day', 'Data_Quality', 'Max_Temp_(°C)',
       'Max_Temp_Flag', 'Min_Temp_(°C)', 'Min_Temp_Flag', 'Mean_Temp_(°C)',
       'Mean_Temp_Flag', 'Heat_Deg_Days_(°C)', 'Heat_Deg_Days_Flag',
       'Cool_Deg_Days_(°C)', 'Cool_Deg_Days_Flag', 'Total_Rain_(mm)',
       'Total_Rain_Flag', 'Total_Snow_(cm)', 'Total_Snow_Flag',
       'Total_Precip_(mm)', 'Total_Precip_Flag', 'Snow_on_Grnd_(cm)',
       'Snow_on_Grnd_Flag', 'Dir_of_Max_Gust_(10s_deg)',
       'Dir_of_Max_Gust_Flag', 'Spd_of_Max_Gust_(km/h)',
       'Spd_of_Max_Gust_Flag'],
      dtype='object')
In [24]:
Grade cell: cell-a8623de117076f50 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(whistler_2018.shape)).encode("utf-8")+b"1cf00").hexdigest() == "a7a2c6696ff01101297e30f4efe7c32d11fa5f08", "type of whistler_2018.shape is not tuple. whistler_2018.shape should be a tuple"
assert sha1(str(len(whistler_2018.shape)).encode("utf-8")+b"1cf00").hexdigest() == "cbb5b0522e44dbb3103ae76756b95d7cf2e00aee", "length of whistler_2018.shape is not correct"
assert sha1(str(sorted(map(str, whistler_2018.shape))).encode("utf-8")+b"1cf00").hexdigest() == "52fc57ff39d4c718f90da2d6a457d157fc998a4d", "values of whistler_2018.shape are not correct"
assert sha1(str(whistler_2018.shape).encode("utf-8")+b"1cf00").hexdigest() == "a6617bef68f30a628afeed5a1d64630fa8657c9d", "order of elements of whistler_2018.shape is not correct"

assert sha1(str(type(sum([" " in st for st in whistler_2018.columns.values]))).encode("utf-8")+b"1cf01").hexdigest() == "9a7fc5c61c0868dcd98b16408aa7869c3c0b058a", "type of sum([\" \" in st for st in whistler_2018.columns.values]) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum([" " in st for st in whistler_2018.columns.values])).encode("utf-8")+b"1cf01").hexdigest() == "ceea9950d338d8995481ff5b1a5bc1983a86b7e7", "value of sum([\" \" in st for st in whistler_2018.columns.values]) is not correct"

print('Success!')
Success!

Question 2.3
{points: 1}

Create a line plot with the date on the x-axis and the total snow per day (in cm) on the y-axis by filling in the ___ in the code below. Ensure you give your axes human-readable labels.

*Note: We need to include :T to the x encoding to specify the x labels to be every month.

Assign your plot to an object called whistler_2018_plot.

In [25]:
Student's answer(Top)
import altair as alt 
whistler_2018_plot = alt.Chart(whistler_2018).mark_line().encode(
    x=alt.X("Date/Time:T").title("Date"),
    y=alt.Y("Total_Snow_(cm)").title("Total Snow (cm)")
).properties( # set the height and the with for the plot
    width=600,
    height=200
)
whistler_2018_plot
Out[25]:
In [26]:
Grade cell: cell-7ba6c116f9f01fbf Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(whistler_2018_plot is None)).encode("utf-8")+b"40509").hexdigest() == "0d7d41755c16937ab31320ad44cdcd6f7b287424", "type of whistler_2018_plot is None is not bool. whistler_2018_plot is None should be a bool"
assert sha1(str(whistler_2018_plot is None).encode("utf-8")+b"40509").hexdigest() == "19412bfa539bb8154f29254b538c2cf15245663c", "boolean value of whistler_2018_plot is None is not correct"

assert sha1(str(type(whistler_2018_plot.mark)).encode("utf-8")+b"4050a").hexdigest() == "11b8e0b8bc0b0159e80b85b6ab42f92beb6c522d", "type of whistler_2018_plot.mark is not str. whistler_2018_plot.mark should be an str"
assert sha1(str(len(whistler_2018_plot.mark)).encode("utf-8")+b"4050a").hexdigest() == "acf4d64872ed2a6d411e8df152984917b66c689b", "length of whistler_2018_plot.mark is not correct"
assert sha1(str(whistler_2018_plot.mark.lower()).encode("utf-8")+b"4050a").hexdigest() == "78e2c2281d536f04de8a82c43b083d070bb9051a", "value of whistler_2018_plot.mark is not correct"
assert sha1(str(whistler_2018_plot.mark).encode("utf-8")+b"4050a").hexdigest() == "78e2c2281d536f04de8a82c43b083d070bb9051a", "correct string value of whistler_2018_plot.mark but incorrect case of letters"

assert sha1(str(type(whistler_2018_plot.encoding.x['shorthand'])).encode("utf-8")+b"4050b").hexdigest() == "d5a10f7be129a52665a317822c622765ae1ba393", "type of whistler_2018_plot.encoding.x['shorthand'] is not str. whistler_2018_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(whistler_2018_plot.encoding.x['shorthand'])).encode("utf-8")+b"4050b").hexdigest() == "95415eeee82c0a754b7ded7101f32435001795d2", "length of whistler_2018_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(whistler_2018_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"4050b").hexdigest() == "7e4d1802a0029bc40cb27fe981a188e3b4db7c77", "value of whistler_2018_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(whistler_2018_plot.encoding.x['shorthand']).encode("utf-8")+b"4050b").hexdigest() == "74dba0a68879e98671e2551f15f0020984113f23", "correct string value of whistler_2018_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(whistler_2018_plot.encoding.y['shorthand'])).encode("utf-8")+b"4050c").hexdigest() == "fd78bf301cd5dc3ba999e28e5e400c77dff25836", "type of whistler_2018_plot.encoding.y['shorthand'] is not str. whistler_2018_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(whistler_2018_plot.encoding.y['shorthand'])).encode("utf-8")+b"4050c").hexdigest() == "380227eb1b943ae708b425045c70b7cb164f4c65", "length of whistler_2018_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(whistler_2018_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"4050c").hexdigest() == "e617455521a3285dc726abadddf2eb53ac0c1e7a", "value of whistler_2018_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(whistler_2018_plot.encoding.y['shorthand']).encode("utf-8")+b"4050c").hexdigest() == "6c3269566b751c2db9893f8ae5ba6316a7f22763", "correct string value of whistler_2018_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(whistler_2018_plot.encoding.x['title'], str))).encode("utf-8")+b"4050d").hexdigest() == "1cfae0c485c964a16b2f698d2e78f44ffd5b89b6", "type of isinstance(whistler_2018_plot.encoding.x['title'], str) is not bool. isinstance(whistler_2018_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(whistler_2018_plot.encoding.x['title'], str)).encode("utf-8")+b"4050d").hexdigest() == "6a89bfd36650c5a482aecf0d18e6c6c39f13bfa2", "boolean value of isinstance(whistler_2018_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(whistler_2018_plot.encoding.y['title'], str))).encode("utf-8")+b"4050e").hexdigest() == "773c745b44c71ec553574c7f1c8d02ff5b504619", "type of isinstance(whistler_2018_plot.encoding.y['title'], str) is not bool. isinstance(whistler_2018_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(whistler_2018_plot.encoding.y['title'], str)).encode("utf-8")+b"4050e").hexdigest() == "cc97e03ed5b4dd7c715e03781c3cfe7cc7d7235e", "boolean value of isinstance(whistler_2018_plot.encoding.y['title'], str) is not correct"

print('Success!')
Success!

Question 2.4
{points: 3}

Looking at the line plot above, for 2018, of the months when it snowed, which 2 months had the most fresh snow?

Student's answer Score: 3.0 / 3.0 (Top)

January and December

Question 2.5
{points: 3}

Repeat the data loading and plot creation using the file eng-daily-01012017-12312017.csv located in the data directory to visualize the same data for the year 2017.

Assign your plot to an object called whistler_2017_plot.

In [27]:
Student's answer(Top)
whistler_2017 = pd.read_csv("data/eng-daily-01012017-12312017.csv", sep=",", skiprows=23)
whistler_2017.columns = [colname.replace(" ", "_") for colname in whistler_2017.columns] # create a list of a new column names
whistler_2017_plot = alt.Chart(whistler_2017).mark_line().encode(
    x=alt.X("Date/Time:T").title("Date"),
    y=alt.Y("Total_Snow_(cm)").title("Total snow (cm)")
).properties( # set the height and the with for the plot
    width=600,
    height=200
)

whistler_2017_plot
Out[27]:
In [28]:
Grade cell: cell-7608e53d69890bf2 Score: 1.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(whistler_2017_plot is None)).encode("utf-8")+b"156ae").hexdigest() == "e5cb555a2b6e4b7b85ef9e32efc66bc4dc62d9e3", "type of whistler_2017_plot is None is not bool. whistler_2017_plot is None should be a bool"
assert sha1(str(whistler_2017_plot is None).encode("utf-8")+b"156ae").hexdigest() == "8fcfb6ddec947a1f7a3ab2cc722371651450edb0", "boolean value of whistler_2017_plot is None is not correct"

assert sha1(str(type(whistler_2017_plot.mark)).encode("utf-8")+b"156af").hexdigest() == "9925ef0bd1a148b1233b4a91796b671a58fedc1f", "type of whistler_2017_plot.mark is not str. whistler_2017_plot.mark should be an str"
assert sha1(str(len(whistler_2017_plot.mark)).encode("utf-8")+b"156af").hexdigest() == "67a7a3262de12c5b46b1cb06e87c88bf55c4247c", "length of whistler_2017_plot.mark is not correct"
assert sha1(str(whistler_2017_plot.mark.lower()).encode("utf-8")+b"156af").hexdigest() == "0f22c9aec9f48a61d101d1b306e7d7ace1b624a8", "value of whistler_2017_plot.mark is not correct"
assert sha1(str(whistler_2017_plot.mark).encode("utf-8")+b"156af").hexdigest() == "0f22c9aec9f48a61d101d1b306e7d7ace1b624a8", "correct string value of whistler_2017_plot.mark but incorrect case of letters"

assert sha1(str(type(whistler_2017_plot.encoding.x['shorthand'])).encode("utf-8")+b"156b0").hexdigest() == "730ab1e9c9dbc36b1731e60966af0deaab964de7", "type of whistler_2017_plot.encoding.x['shorthand'] is not str. whistler_2017_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(whistler_2017_plot.encoding.x['shorthand'])).encode("utf-8")+b"156b0").hexdigest() == "f553330d9a90ebc12bf6cd3d028e2b2ee0905dd5", "length of whistler_2017_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(whistler_2017_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"156b0").hexdigest() == "ae4461d9a48f26384653bab3ba884915a5653962", "value of whistler_2017_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(whistler_2017_plot.encoding.x['shorthand']).encode("utf-8")+b"156b0").hexdigest() == "83bec3b45dfa025a24af8df2c7b153eb308a0d4a", "correct string value of whistler_2017_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(whistler_2017_plot.encoding.y['shorthand'])).encode("utf-8")+b"156b1").hexdigest() == "2537428a7fd97eb50949d9bc3f5df844b615b82e", "type of whistler_2017_plot.encoding.y['shorthand'] is not str. whistler_2017_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(whistler_2017_plot.encoding.y['shorthand'])).encode("utf-8")+b"156b1").hexdigest() == "330cef88eebb98455fe477baab4650670f812729", "length of whistler_2017_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(whistler_2017_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"156b1").hexdigest() == "4a26b8dad39c1ab62d013eb34b1c33242c113e09", "value of whistler_2017_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(whistler_2017_plot.encoding.y['shorthand']).encode("utf-8")+b"156b1").hexdigest() == "0035f65d4844948e4ae5a7560a61bbf836e7b3e4", "correct string value of whistler_2017_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(whistler_2017_plot.encoding.x['title'], str))).encode("utf-8")+b"156b2").hexdigest() == "56c10193fe0b16eb5fb133f941ccb12168184046", "type of isinstance(whistler_2017_plot.encoding.x['title'], str) is not bool. isinstance(whistler_2017_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(whistler_2017_plot.encoding.x['title'], str)).encode("utf-8")+b"156b2").hexdigest() == "5680d24878ae08c6ede0b2ed0725842ad965db7b", "boolean value of isinstance(whistler_2017_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(whistler_2017_plot.encoding.y['title'], str))).encode("utf-8")+b"156b3").hexdigest() == "85b652dbeeb0aa69dfc5794e676e73b4ae55da85", "type of isinstance(whistler_2017_plot.encoding.y['title'], str) is not bool. isinstance(whistler_2017_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(whistler_2017_plot.encoding.y['title'], str)).encode("utf-8")+b"156b3").hexdigest() == "16610707026dc884332f8190bbce34f926be9951", "boolean value of isinstance(whistler_2017_plot.encoding.y['title'], str) is not correct"

print('Success!')
Success!

Question 2.6
{points: 3}

Looking at the line plot above, for 2017, of the months when it snowed, which 2 months had the most fresh snow?

Student's answer Score: 3.0 / 3.0 (Top)

March and November

Question 2.7
{points: 3}

Are the months with the most fresh snow the same in 2017 as they were in 2018? Hint: you might want to add a code cell where you plot the two plots right after each other so you can easily compare them in one screen view.

Is there any advantage of looking at 2 years worth of data? Why or why not?

Student's answer Score: 3.0 / 3.0 (Top)

No, the months with the most fresh snow are not the same as they were across the two years. In 2017, March and November had the most fresh snow (as indicated by the highest levels of fresh snow across the 12 months), and in 2018, January and December had the most fresh snow. There is an advantage to looking at 2 years of data because it provides some comparison and could show us errors or variability in results.

In [29]:
whistler_2018_plot = alt.Chart(whistler_2018).mark_line().encode(
    x=alt.X("Date/Time:T").title("Date"),
    y=alt.Y("Total_Snow_(cm)").title("Total Snow (cm)")
).properties( # set the height and the with for the plot
    width=600,
    height=200
)
whistler_2018_plot
Out[29]:
In [30]:
whistler_2017 = pd.read_csv("data/eng-daily-01012017-12312017.csv", sep=",", skiprows=23)
whistler_2017.columns = [colname.replace(" ", "_") for colname in whistler_2017.columns] # create a list of a new column names
whistler_2017_plot = alt.Chart(whistler_2017).mark_line().encode(
    x=alt.X("Date/Time:T").title("Date"),
    y=alt.Y("Total_Snow_(cm)").title("Total snow (cm)")
).properties( # set the height and the with for the plot
    width=600,
    height=200
)

whistler_2017_plot
Out[30]:

3. Reading from a Database

In worksheet_reading, you'll recall that we opened a database stored in a .db file. This involved a lot more effort than just opening a .csv, .tsv, or any of the other plaintext / Excel formats.

Why should we bother with databases at all?

Databases become really useful in a large-scale setting:

  • they enable storing large datasets across multiple computers with automatic redundancy and backups
  • they enable multiple users to access them simultaneously and remotely without conflicts and errors
  • they provide mechanisms for ensuring data integrity and validating input
  • they provide security to keep data safe

For example: there are around 4 billion Google searches conducted daily as of 2019. Can you imagine if Google stored all of the data from those queries in a single .csv file!? Chaos would ensue.

To reap the real benefits of databases, we'll need to move to a more fully-powered one: PostgreSQL. We'll begin by loading the ibis package that Python uses to talk to databases and the altair package which we will need for visualization.

In [31]:
### Run this cell before continuing.
import ibis
import altair as alt

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()
Out[31]:
DataTransformerRegistry.enable('vegafusion')

Question 3.0
{points: 1}

Databases are often stored remotely (i.e., not on your computer or on this JupyterHub). Your first task is to load the Kickstarter data from a PostgreSQL database stored remotely on the UBC statistics network.

URL: "dsci-100-student-py.stat.ubc.ca"

Port: 5432

Username: "dsci100"

Password: "dsci100"

Database Name: "kickstarter"

Table Name: "projects"

We've provided the code to do this below. Replace each ___ with one of the 5 above items.

Note: As this database will be used by the entire class, you will only have read access (no write permissions).

Assign the resulting database connection object to conn and the project table data to db_table.

In [32]:
Student's answer(Top)
conn = ibis.postgres.connect(
    database = "kickstarter", 
    host = "dsci-100-student-py.stat.ubc.ca",
    port = 5432,
    user = "dsci100",
    password = "dsci100"
)
#We name this variable `db_table` to indicate that it is a database table
db_table = conn.table("projects")
In [33]:
Grade cell: cell-43c8f02a1912363c Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(type(conn))).encode("utf-8")+b"65d2b").hexdigest() == "0776cb7fd9a4ff21d14dbc43c5f365436a0ef379", "type of type(conn) is not correct"
assert sha1(str(type(conn)).encode("utf-8")+b"65d2b").hexdigest() == "d2d3c55624cfc12bca7e971eaee5fb3d394eae97", "value of type(conn) is not correct"

assert sha1(str(type(conn.list_tables())).encode("utf-8")+b"65d2c").hexdigest() == "ff0bbdf7819bcbe1d29cd54def001b855b01680b", "type of conn.list_tables() is not list. conn.list_tables() should be a list"
assert sha1(str(len(conn.list_tables())).encode("utf-8")+b"65d2c").hexdigest() == "cd35a409be35c0b04859a0b3bcae8b75289ee8c3", "length of conn.list_tables() is not correct"
assert sha1(str(sorted(map(str, conn.list_tables()))).encode("utf-8")+b"65d2c").hexdigest() == "c4c380220745d27bcbff20af81e6c600931fb188", "values of conn.list_tables() are not correct"
assert sha1(str(conn.list_tables()).encode("utf-8")+b"65d2c").hexdigest() == "c4c380220745d27bcbff20af81e6c600931fb188", "order of elements of conn.list_tables() is not correct"

assert sha1(str(type(db_table is None)).encode("utf-8")+b"65d2d").hexdigest() == "e47bc59cbec5329564c5a16ba24c1fa25c165029", "type of db_table is None is not bool. db_table is None should be a bool"
assert sha1(str(db_table is None).encode("utf-8")+b"65d2d").hexdigest() == "6b548f588803d50da4863356a071f7abb4ea4685", "boolean value of db_table is None is not correct"

assert sha1(str(type(type(db_table))).encode("utf-8")+b"65d2e").hexdigest() == "73d45e656b5a791cf9fe8c4747218d82debaf622", "type of type(db_table) is not correct"
assert sha1(str(type(db_table)).encode("utf-8")+b"65d2e").hexdigest() == "d23dec0aa1931361f98827b52c0e7e9eeeb3d259", "value of type(db_table) is not correct"

assert sha1(str(type(db_table.columns)).encode("utf-8")+b"65d2f").hexdigest() == "d095f016e289faae3240c00b783d0faf8da79a95", "type of db_table.columns is not list. db_table.columns should be a list"
assert sha1(str(len(db_table.columns)).encode("utf-8")+b"65d2f").hexdigest() == "77aceee8034d6bfe2064a26dfc657c4309e5e71d", "length of db_table.columns is not correct"
assert sha1(str(sorted(map(str, db_table.columns))).encode("utf-8")+b"65d2f").hexdigest() == "c5cfe351515b407b58c6769bd1f8d8313e9d183d", "values of db_table.columns are not correct"
assert sha1(str(db_table.columns).encode("utf-8")+b"65d2f").hexdigest() == "5eeb668fade3c11a9f98d23d57fe4d2eee39ae71", "order of elements of db_table.columns is not correct"

print('Success!')
Success!

We can now call the columns attribute to see what columns are available in the db_table table.

In [34]:
db_table.columns
Out[34]:
['id',
 'state',
 'url_project',
 'url_project_short',
 'name',
 'country',
 'creator_id',
 'location_id',
 'category_id',
 'created_at',
 'deadline',
 'launched_at',
 'goal',
 'pledged',
 'currency',
 'currency_symbol',
 'usd_pledged',
 'static_usd_rate',
 'backers_count',
 'comments_count',
 'updates_count',
 'spotlight',
 'staff_pick',
 'blurb',
 'currency_trailing_code',
 'disable_communication',
 'photo_url',
 'profile_blurb',
 'profile_id',
 'profile_name',
 'profile_project_id',
 'profile_state',
 'slug',
 'url_rewards',
 'url_updates',
 'video_id',
 'video_url_high',
 'video_url_webm',
 'video_height',
 'video_width',
 'video_status',
 'file_name',
 'last_modification',
 'deleted_comments']

Question 3.1
{points: 1}

If we want to plot compare pledged and goal amounts of funding over time for successful projects in the United States, which columns should we select from the table?

A. id, slug, pledged

B. pledged, goal, deadline, country

C. pledged, usd_pledged, location_id

D. currency, state, country, goal

Assign your answer to an object called answer3_1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [35]:
Student's answer(Top)
answer3_1 = "B"
In [36]:
Grade cell: cell-c984d14fba25881a Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer3_1)).encode("utf-8")+b"43ed2").hexdigest() == "76d49cc8e85798b8c7b97ee4c559a753ec5c3f46", "type of answer3_1 is not str. answer3_1 should be an str"
assert sha1(str(len(answer3_1)).encode("utf-8")+b"43ed2").hexdigest() == "cd346f0042793153f507e436b50dd03dc9180f82", "length of answer3_1 is not correct"
assert sha1(str(answer3_1.lower()).encode("utf-8")+b"43ed2").hexdigest() == "1a542e59e7b28fc16599a7f966bef3f8ce1518d9", "value of answer3_1 is not correct"
assert sha1(str(answer3_1).encode("utf-8")+b"43ed2").hexdigest() == "6da6e60c00e6fbb828c32d52456ae75ff23c8007", "correct string value of answer3_1 but incorrect case of letters"

print('Success!')
Success!

Question 3.2
{points: 1}

Now we'll visualize the data. In order to do this, we need to take the correct subset of data from the table and use altair to plot the result. Note that we make the scatter plot slightly transparent (using opacity = 0.25 in the code below) because there is so much data that it would otherwise be hard to see anything (overplotting).

In the below cell, you'll see some lines of code (currently commented out with # characters). Remove the comments and rearrange these lines of code to plot the ratio of pledged and goal funding as a function of project deadline date for all successful (where pledged funding is greater than goal funding) projects in the United States in the dataset. You don't need to add any new code, just reorder the lines we have given you.

Note: there is a lot of data to plot here, so give it a moment to display!

*Hint: you'll want to put all the dataframe manipulation functions first, and then the plotting functions afterward. To not be overwhelmed trying to solve all the code at once, focus on one step at a time and uncomment only the code needed to run that one step. When that step works, move on to the next.

In [37]:
Student's answer(Top)
db_unfiltered = db_table[["deadline", "pledged", "goal", "country"]]
db_filtered = db_unfiltered[
    (db_unfiltered["pledged"] > db_unfiltered["goal"])
    & (db_unfiltered["country"] == "US")
]
df = db_filtered.execute()  # Create a dataframe from the database connection
df = df.assign(
    deadline=pd.to_datetime(df["deadline"], unit="s"),
    ratio=df["pledged"] / df["goal"]
)

funding_over_time_plot = alt.Chart(df).mark_circle(opacity=0.25).encode(
    x=alt.X("deadline").title("Date"),
    y=alt.Y("ratio:Q")
    .title("Pledged Funding / Goal Funding")
    .scale(type="log", base=10)
)
funding_over_time_plot
Out[37]:
In [38]:
Grade cell: cell-52e8a2b1e3010e05 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(df.columns)).encode("utf-8")+b"2864e").hexdigest() == "fe36dc32338ceedf969636f56ba97fd95e32ad34", "type of df.columns is not correct"
assert sha1(str(df.columns).encode("utf-8")+b"2864e").hexdigest() == "0e6fa79bf9d8de69c03a4dbc058fe504ad5d1873", "value of df.columns is not correct"

assert sha1(str(type(funding_over_time_plot is None)).encode("utf-8")+b"2864f").hexdigest() == "91075adc98ad69a39616cee7ecefb718336abfbc", "type of funding_over_time_plot is None is not bool. funding_over_time_plot is None should be a bool"
assert sha1(str(funding_over_time_plot is None).encode("utf-8")+b"2864f").hexdigest() == "b38a310b1c7ccd2860723f8c7b9235e450766008", "boolean value of funding_over_time_plot is None is not correct"

assert sha1(str(type(funding_over_time_plot.mark)).encode("utf-8")+b"28650").hexdigest() == "447caf1f34665e6328d5a09731fb23b4ca7ecbc5", "type of funding_over_time_plot.mark is not correct"
assert sha1(str(funding_over_time_plot.mark).encode("utf-8")+b"28650").hexdigest() == "f16e523ea7e255da4bebd0764226edfdd3fec593", "value of funding_over_time_plot.mark is not correct"

assert sha1(str(type(funding_over_time_plot.encoding.x['shorthand'])).encode("utf-8")+b"28651").hexdigest() == "17c744de2ab9a5702c2456b69cef17119c644e3a", "type of funding_over_time_plot.encoding.x['shorthand'] is not str. funding_over_time_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(funding_over_time_plot.encoding.x['shorthand'])).encode("utf-8")+b"28651").hexdigest() == "27afe30fb7f9d4d4e1e1044424384c9bb964c085", "length of funding_over_time_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"28651").hexdigest() == "221c7b3f5e44d1130eb2ecd2977fc2da288b1592", "value of funding_over_time_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.x['shorthand']).encode("utf-8")+b"28651").hexdigest() == "221c7b3f5e44d1130eb2ecd2977fc2da288b1592", "correct string value of funding_over_time_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(funding_over_time_plot.encoding.y['shorthand'])).encode("utf-8")+b"28652").hexdigest() == "0eb1eed1ab9cc10edf64ec355ce5c6a7bb5472bc", "type of funding_over_time_plot.encoding.y['shorthand'] is not str. funding_over_time_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(funding_over_time_plot.encoding.y['shorthand'])).encode("utf-8")+b"28652").hexdigest() == "1fc65799b426ff3c27fab50ce7f58aeb078d398a", "length of funding_over_time_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"28652").hexdigest() == "6249b0f8c50424b905ecd6cb40af23b7ada86291", "value of funding_over_time_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.y['shorthand']).encode("utf-8")+b"28652").hexdigest() == "ece57890283fcce35e4cb2d3aad3b97c887bfb3b", "correct string value of funding_over_time_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(funding_over_time_plot.encoding.x['title'])).encode("utf-8")+b"28653").hexdigest() == "5659eed2597abde16afbb8a23c1dbd0a5dc72e56", "type of funding_over_time_plot.encoding.x['title'] is not str. funding_over_time_plot.encoding.x['title'] should be an str"
assert sha1(str(len(funding_over_time_plot.encoding.x['title'])).encode("utf-8")+b"28653").hexdigest() == "35ea397730f614ceab4bbb56ef3bbf5ce9cd6986", "length of funding_over_time_plot.encoding.x['title'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.x['title'].lower()).encode("utf-8")+b"28653").hexdigest() == "e2d12b2dee708091328edd1d45230e9e8c9961f2", "value of funding_over_time_plot.encoding.x['title'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.x['title']).encode("utf-8")+b"28653").hexdigest() == "abc6a3da896944c5e2cb84df108421063419f5fb", "correct string value of funding_over_time_plot.encoding.x['title'] but incorrect case of letters"

assert sha1(str(type(funding_over_time_plot.encoding.y['title'])).encode("utf-8")+b"28654").hexdigest() == "c4fffd56053a178c42c4dd211ca260e01a3317bc", "type of funding_over_time_plot.encoding.y['title'] is not str. funding_over_time_plot.encoding.y['title'] should be an str"
assert sha1(str(len(funding_over_time_plot.encoding.y['title'])).encode("utf-8")+b"28654").hexdigest() == "4f6d7fc58df4386cd503d56385d3a3f7013fb370", "length of funding_over_time_plot.encoding.y['title'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.y['title'].lower()).encode("utf-8")+b"28654").hexdigest() == "0cab9c190a4c71880c6d47c264bf26adc4674cde", "value of funding_over_time_plot.encoding.y['title'] is not correct"
assert sha1(str(funding_over_time_plot.encoding.y['title']).encode("utf-8")+b"28654").hexdigest() == "b91a8a6e76bcf931480f4f20e73f9b4d3ef13355", "correct string value of funding_over_time_plot.encoding.y['title'] but incorrect case of letters"

print('Success!')
Success!

Question 3.3
{points: 3}

Is there a relationship between the ratio of pledged/goal funding and time? If so, describe it.

Additionally, mention a pattern in the data or a characteristic of it that you may not have expected in advance.

Student's answer Score: 0.5 / 3.0 (Top)

There is no relationship between the ratio of pledge/goal funding and time, although there is too much data to tell. There are too many data points, and this creates a concentrated characteristic - one that is too difficult to see a proper correlation.

Question 3.4
{points: 1}

Finally, we'll save the project data frame to a local file in the data/ folder called project_data.csv. Recall that we don't want to try to download and save the entire dataset (way too much data!) from the database, but only the DataFrame object named df. Make sure not to include the index numbers in the saved CSV file.

In [39]:
Student's answer(Top)
import os
df.to_csv("data/project_data.csv",index=False)
In [40]:
Grade cell: cell-38725218bfd3fa34 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(os.path.exists('data/project_data.csv'))).encode("utf-8")+b"41b72").hexdigest() == "2ccd6b9d0d2697930b02ad808e0ce72881238b47", "type of os.path.exists('data/project_data.csv') is not bool. os.path.exists('data/project_data.csv') should be a bool"
assert sha1(str(os.path.exists('data/project_data.csv')).encode("utf-8")+b"41b72").hexdigest() == "45b1cf905c3d19eda13928aa21154e1e2347d33e", "boolean value of os.path.exists('data/project_data.csv') is not correct"

assert sha1(str(type(pd.read_csv("data/project_data.csv").columns)).encode("utf-8")+b"41b73").hexdigest() == "75d848743af8b88d33764c66af840a0dd6722f23", "type of pd.read_csv(\"data/project_data.csv\").columns is not correct"
assert sha1(str(pd.read_csv("data/project_data.csv").columns).encode("utf-8")+b"41b73").hexdigest() == "67364571957648813b368cd2bb81cde27eb37ee8", "value of pd.read_csv(\"data/project_data.csv\").columns is not correct"

print('Success!')
Success!

4 (Optional). Reading Data from the Internet

Question 4.0
{points: 0}

More practice scraping! To keep ourselves out of legal hot water, we will get more practice scraping data using a website that was created for that purpose: http://books.toscrape.com/

Your task here is to scrape the prices of the science fiction novels on this page and determine the maximum, minimum and average price of science fiction novels at this bookstore. Tidy up and nicely present your results by creating a data frame called sci_fi_stats that has 2 columns, one called stats that contains the words max, min and mean and once called value that contains the calculated value for each of these.

The functions for maximum, minimum and average in Python are listed in the table below:

Calculation to perform Function in Python
maximum max
minimum min
average sum/len

*Note: Python doesn't have build in function to calculate the mean for a list, however numpy has a mean function. If you are interested, you could use np.mean instead of the function listed above.

Some other helpful hints:

  • If you end up scraping some characters other than numbers you will have to use str.replace to remove them (similar to what we did with the commas in worksheet_02).
  • Use float to convert your character type numbers to numeric type numbers before you pass them into the max, min and sum/len functions.
  • Create a list that will go in your data frame, for example, to create a list with the values 10, 16 and 13 named ages, we would type: ages = [10, 16, 13].
  • use the function pd.DataFrame to create the data frame from your list.
In [41]:
### Run this cell before continuing
import requests
from bs4 import BeautifulSoup
In [42]:
Student's answer(Top)
# your code here
raise NotImplementedError
sci_fi_stats
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[42], line 2
      1 # your code here
----> 2 raise NotImplementedError
      3 sci_fi_stats

NotImplementedError: 

Question 4.1
{points: 0}

In worksheet_02 you had practice scraping data from the web. Now that you have the skills, should you scrape that website you have been dreaming of harvesting data from? Maybe, maybe not... You should check the website's Terms of Service first and consider the application you have planned for the data after you scrape it.

List 3 websites you might be interested in scraping data from (for fun, profit, or research/education). List their URLs as part of your answer. For each website, search for their Terms of Service page. Take note if such a page exists, and if it does, provide the link to it and tell us whether or not they allow web scraping of their website.

In [ ]:
 

Bonus/optional additional readings on legalities of web scraping:

Here are two recent news stories about web scraping and their legal implications:

In [43]:
try:
    os.remove("data/WHR2018Chapter2OnlineData.xls")
except:
    None
In [44]:
try:
    os.remove("data/project_data.csv")
except:
    None
In [ ]: